作者:Mannat Singh Quentin Duval Kalyan Vasudev Alwala Haoqi Fan Vaibhav Aggarwal Aaron Adcock Armand Joulin Piotr Dollár Christoph Feichtenhofer Ross Girshick Rohit Girdhar Ishan Misra
本文重新审视了在计算机视觉中用于视觉识别任务的标准预训练-然后微调范式。通常,最先进的基础模型是使用具有数十亿图像的大规模(弱)监督数据集进行预训练的。我们引入了一个额外的预训练阶段,它很简单,并使用自监督MAE技术来初始化模型。虽然MAE只被证明会随着模型的大小而缩放,但我们发现它也会随着训练数据集的大小而扩展。因此,我们基于MAE的数据预训练在模型和数据大小上都有一定的伸缩性,使其适用于训练基础模型。预预训练在一系列模型尺度(数百万到数十亿个参数)和数据集大小(百万到数十亿个图像)上持续提高模型收敛和下游传输性能。我们测量了预预训练对10种不同视觉识别任务的有效性,这些任务包括图像分类、视频识别、物体识别
This paper revisits the standard pretrain-then-finetune paradigm used incomputer vision for visual recognition tasks. Typically, state-of-the-artfoundation models are pretrained using large scale (weakly) supervised datasetswith billions of images. We introduce an additional pre-pretraining stage thatis simple and uses the self-supervised MAE technique to initialize the model.While MAE has only been shown to scale with the size of models, we find that itscales with the size of the training dataset as well. Thus, our MAE-basedpre-pretraining scales with both model and data size making it applicable fortraining foundation models. Pre-pretraining consistently improves both themodel convergence and the downstream transfer performance across a range ofmodel scales (millions to billions of parameters), and dataset sizes (millionsto billions of images). We measure the effectiveness of pre-pretraining on 10different visual recognition tasks spanning image classification, videorecognition, object detection, low-shot classification and zero-shotrecognition. Our largest model achieves new state-of-the-art results oniNaturalist-18 (91.3%), 1-shot ImageNet-1k (62.1%), and zero-shot transfer onFood-101 (96.0%). Our study reveals that model initialization plays asignificant role, even for web-scale pretraining with billions of images.
论文链接:http://arxiv.org/pdf/2303.13496v1
更多计算机论文:http://cspaper.cn/